Introduction

In today’s data-driven world, job roles related to data are not only proliferating but also evolving at a rapid pace. From data scientists and data engineers to analysts and architects, the spectrum of data-related professions continues to expand. As these roles become more nuanced and specialized, determining a competitive salary becomes a challenge for both employers and potential employees.

Enter “Glassdoor’s Data Jobs- Salary Predictor” - our state-of-the-art predictive model designed to address this very challenge. Leveraging vast amounts of salary data available on Glassdoor, one of the world’s largest job and recruiting platforms, our model hopes to accurate salary predictions for various data-centric job roles.

In the following sections, we will delve deeper into the mechanics of our model, the data that powers it, and the actionable insights it can provide. Let’s embark on this journey to demystify the world of data job salaries on Glassdoor.

Objectives

  1. Objective 1: Build a model with the main goal to identify key relationships between the average salary and the job descriptions, such as the required skills, job titles, job locations, etc. and more importantly, is highly interpretable.

  2. Objective 2: Build multiple models with the goal of developing a model that can predict the best and do well on future data.

Data Descriptions

Our data set was from Kaggle. It included 742 job postings in 2021, which scraped from Glassdoor using a Selenium-based scraper. After scrapping, the raw data set was split into 42 columns, including the original job posts.

library

#Libraries 
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(corrplot)
## corrplot 0.92 loaded
library(caret)   
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(naniar)

Data Processing

Input the dataset from Kaggle

setwd("~/Desktop/MSDS/Term2_Fall2023/Stat2/Projects/Project1/R")
data<-read.csv("data_cleaned_2021.csv")

#Checking data types
str(data)
## 'data.frame':    742 obs. of  42 variables:
##  $ index             : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Job.Title         : chr  "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
##  $ Salary.Estimate   : chr  "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
##  $ Job.Description   : chr  "Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, "| __truncated__ "What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analyt"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological "| __truncated__ ...
##  $ Rating            : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Company.Name      : chr  "Tecolote Research\n3.8" "University of Maryland Medical System\n3.4" "KnowBe4\n4.8" "PNNL\n3.8" ...
##  $ Location          : chr  "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Headquarters      : chr  "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Size              : chr  "501 - 1000 " "10000+ " "501 - 1000 " "1001 - 5000 " ...
##  $ Founded           : int  1973 1984 2010 1965 1998 2000 2008 2005 2014 2009 ...
##  $ Type.of.ownership : chr  "Company - Private" "Other Organization" "Company - Private" "Government" ...
##  $ Industry          : chr  "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
##  $ Sector            : chr  "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
##  $ Revenue           : chr  "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
##  $ Competitors       : chr  "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
##  $ Hourly            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Employer.provided : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Lower.Salary      : int  53 63 80 56 86 71 54 86 38 120 ...
##  $ Upper.Salary      : int  91 112 90 97 143 119 93 142 84 160 ...
##  $ Avg.Salary.K.     : num  72 87.5 85 76.5 114.5 ...
##  $ company_txt       : chr  "Tecolote Research" "University of Maryland Medical System" "KnowBe4" "PNNL" ...
##  $ Job.Location      : chr  "NM" "MD" "FL" "WA" ...
##  $ Age               : int  48 37 11 56 23 21 13 16 7 12 ...
##  $ Python            : int  1 1 1 1 1 1 0 1 0 1 ...
##  $ spark             : int  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws               : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel             : int  1 0 1 0 1 1 1 1 0 0 ...
##  $ sql               : int  0 0 1 0 1 1 0 1 0 0 ...
##  $ sas               : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ keras             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pytorch           : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ scikit            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tensor            : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ hadoop            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tableau           : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ bi                : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ flink             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mongo             : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ google_an         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ job_title_sim     : chr  "data scientist" "data scientist" "data scientist" "data scientist" ...
##  $ seniority_by_title: chr  "na" "na" "na" "na" ...
##  $ Degree            : chr  "M" "M" "M" "na" ...
vis_miss(data)

dim(data)
## [1] 742  42

The summary satistics of the original dataset

summary(data)
##      index        Job.Title         Salary.Estimate    Job.Description   
##  Min.   :  0.0   Length:742         Length:742         Length:742        
##  1st Qu.:221.5   Class :character   Class :character   Class :character  
##  Median :472.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :469.1                                                           
##  3rd Qu.:707.8                                                           
##  Max.   :955.0                                                           
##      Rating       Company.Name         Location         Headquarters      
##  Min.   :-1.000   Length:742         Length:742         Length:742        
##  1st Qu.: 3.300   Class :character   Class :character   Class :character  
##  Median : 3.700   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 3.619                                                           
##  3rd Qu.: 4.000                                                           
##  Max.   : 5.000                                                           
##      Size              Founded     Type.of.ownership    Industry        
##  Length:742         Min.   :  -1   Length:742         Length:742        
##  Class :character   1st Qu.:1939   Class :character   Class :character  
##  Mode  :character   Median :1988   Mode  :character   Mode  :character  
##                     Mean   :1837                                        
##                     3rd Qu.:2007                                        
##                     Max.   :2019                                        
##     Sector            Revenue          Competitors            Hourly       
##  Length:742         Length:742         Length:742         Min.   :0.00000  
##  Class :character   Class :character   Class :character   1st Qu.:0.00000  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.00000  
##                                                           Mean   :0.03234  
##                                                           3rd Qu.:0.00000  
##                                                           Max.   :1.00000  
##  Employer.provided  Lower.Salary     Upper.Salary   Avg.Salary.K.  
##  Min.   :0.00000   Min.   : 15.00   Min.   : 16.0   Min.   : 15.5  
##  1st Qu.:0.00000   1st Qu.: 52.00   1st Qu.: 96.0   1st Qu.: 73.5  
##  Median :0.00000   Median : 69.50   Median :124.0   Median : 97.5  
##  Mean   :0.02291   Mean   : 74.75   Mean   :128.2   Mean   :101.5  
##  3rd Qu.:0.00000   3rd Qu.: 91.00   3rd Qu.:155.0   3rd Qu.:122.5  
##  Max.   :1.00000   Max.   :202.00   Max.   :306.0   Max.   :254.0  
##  company_txt        Job.Location            Age             Python      
##  Length:742         Length:742         Min.   : -1.00   Min.   :0.0000  
##  Class :character   Class :character   1st Qu.: 12.00   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median : 25.00   Median :1.0000  
##                                        Mean   : 47.52   Mean   :0.5283  
##                                        3rd Qu.: 60.00   3rd Qu.:1.0000  
##                                        Max.   :277.00   Max.   :1.0000  
##      spark             aws             excel             sql        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.2251   Mean   :0.2372   Mean   :0.5229   Mean   :0.5121  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       sas              keras            pytorch            scikit       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.08895   Mean   :0.03908   Mean   :0.05256   Mean   :0.07278  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##      tensor            hadoop          tableau             bi         
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.09703   Mean   :0.1671   Mean   :0.1995   Mean   :0.07547  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      flink             mongo           google_an       job_title_sim     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Length:742        
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   Class :character  
##  Median :0.00000   Median :0.00000   Median :0.00000   Mode  :character  
##  Mean   :0.01348   Mean   :0.04987   Mean   :0.01887                     
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000                     
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000                     
##  seniority_by_title    Degree         
##  Length:742         Length:742        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Data cleaning - Removing some of the columns that we dont need:

After evaluating all of the variables,we decided to drop some of them because of their existence lead to the redundancy of other variables. For example, the job title was a detailed job title from the job postings. Most of them are unique and would not provide a meaningful prediction.

The Salary Estimates is the salary ranged provided in the job listings, based on the Salary estimates, we got the upper and the slower salary and thus, find the average salary by dividing the sum of the upper and the slower salary by 2. That is why we removed all of the variables relating to the salary except the average salary. We also removed the other variables on the list for the same reason.

We moved forward with 28 variables. Following are the variable that were dropped from all data set. 1 - index 2 - Job.Title (the same as job_title_sim) 3 - Salary Estimate 4 - Job Description (Glassdor job posts) 6 - Company Name 7 - Location 8 - Headquarter 10 - Founded (Year that company was founded) 12 - Industry 15 - Competitors 17 - Employer Provided 18 - Lower Salary 19 - Upper Salary 21 - Company text

data <- data[ , -c(1,2,3,4,6,7,8,10,12,15,17,18,19,21)]

Checking for the missing values

#Find the na values 
columns_with_na <- names(data)[colSums(is.na(data)) > 0]
# Display the column names with NA values
print(columns_with_na)
## character(0)

There is no null values based on the initial checking point.

Factoring the categorical variables

##Factorize the categorized variables
data <- data %>% mutate_if(is.character, factor)

Checking the missing values of the job titles and then replace na value with more meaningful variable

# Replace "na" levels with "not specified" using fct_recode()
data$job_title_sim <- fct_recode(data$job_title_sim, "other data related jobs" = "na")

There are 10 variables that the job_stitle_sim are “na”. Upon checking the original data sets, these job postings does relate to data analysis even though their job titles do not contain “data”. We decided to keep the data points but create a new level for the job title, called “other data related jobs.

Checking and cleaning up the Sector variable

# Replace "-1" levels with "not specified" using fct_recode()
data$Sector <- fct_recode(data$Sector, "not specified" = "-1")

There are 10 variables that the sector are “na”. We decided to keep the data points but create a new level for missing sector, called “not specified”.

Rename column titles

# Assuming 'data' is your data frame
# Use colnames() to change the column name
colnames(data)[colnames(data) == "Type.of.ownership"] <- "Ownership_type"
colnames(data)[colnames(data) == "Age"] <- "Comp_history"
colnames(data)[colnames(data) == "Avg.Salary.K."] <- "AvgSalary"
colnames(data)[colnames(data) == "job_title_sim"] <- "Job_title"
colnames(data)[colnames(data) == "seniority_by_title"] <- "Experience"

Replace the na Experience as “no experience specified”

# Replace "-1" levels with "not specified" using fct_recode()
data$Experience <- fct_recode(data$Experience, "no experience required" = "na")

There are 519 job postings do not mention experiences. We assume that job does not limit applicants from a specific level of experience. So we simply replace the “na” value with “no experience required”.

EDA

#Corrplot

# Select only numeric columns from the data frame
numeric_data <- select_if(data, is.numeric)

# Compute the correlation matrix for numeric variables
cor_matrix <- cor(numeric_data)

# Customize corrplot (e.g., change color, method, etc.)
# For more customization options, check the documentation: ?corrplot
corrplot(cor_matrix, method = "color", tl.cex = 0.7)

Notably, keras, pytorch, scikit and tensor has high multi-collinearity. This is expected as keras and pytorch are the features built on top of tensor flow. Therefore, if a job require one of these skills, there are high chance that the employee will also need to be familiar with the other skills to do their jobs efficiently. To reduce the dependency between these variables, we combine it as one variable called “ML_skills”. Similarly, we combined Hadoop and spark as “data_framework”; tableau and bi were combined as data_visualization.

# Create a new column by summing the four columns

data$ML_skills <- data$keras + data$pytorch + data$scikit+ data$tensor

data$vizualization <- ifelse(data$bi == 1 & data$tableau == 1, "both",
               ifelse(data$bi == 1 & data$tableau == 0, "bi",
               ifelse(data$bi == 0 & data$tableau == 1, "tableau", "none")))



data$data_framework <- ifelse(data$spark == 1 & data$hadoop == 1, "both",
               ifelse(data$spark == 1 & data$hadoop == 0, "spark",
               ifelse(data$spark == 0 & data$hadoop == 1, "hadoop", "none")))

Removing the following columns

11 - spark
16 - pytorch 17 - keras 18 - scikit 19 - tensorflow 20 - tableau 21 - bi 22 - hadoop

We removed the original variables of the skills mentioned above.

# data <- data[ , -c(25,30,31,32,33,34,35,36)]
data <- data[ , -c(11,16,17,18,19,20,21,22)]

#Checking multicollinearity

#Corrplot

# Select only numeric columns from the data frame
numeric_data <- select_if(data, is.numeric)

# Compute the correlation matrix for numeric variables
cor_matrix <- cor(numeric_data)

# Customize corrplot (e.g., change color, method, etc.)
# For more customization options, check the documentation: ?corrplot
corrplot(cor_matrix, method = "color", tl.cex = 0.7)

Notice now the multicollinearity is improved. We noticed that among the numerical variables, the highest correlation is between the average salary and Python (0.324)

#Display the counts of each level of the categorial varibles

for (col in names(data)) {
  if (is.factor(data[[col]])) {
    cat("Variable:", col, "\n")
    cat(table(data[[col]]), "\n\n")
  }
}
## Variable: Size 
## 31 130 150 117 76 134 94 10 
## 
## Variable: Ownership_type 
## 13 410 193 15 15 55 5 2 34 
## 
## Variable: Sector 
## 10 1 25 1 4 112 97 3 4 23 42 11 49 180 69 34 6 3 9 14 8 15 6 8 8 
## 
## Variable: Revenue 
## 60 8 32 124 91 39 40 19 18 46 57 4 204 
## 
## Variable: Job.Location 
## 8 9 152 11 5 11 6 16 6 5 2 40 10 3 6 4 103 35 6 2 9 21 4 17 3 72 14 4 33 1 1 13 28 10 41 21 10 
## 
## Variable: Job_title 
## 101 8 119 5 313 16 5 22 10 143 
## 
## Variable: Experience 
## 3 519 220 
## 
## Variable: Degree 
## 252 383 107

Remove no rating datapoints

# Remove rows where Rating is -1 using logical indexing
data <- data[data$AvgSalary != 15.5, ]

#Replace the not Specific Company Age as Median
# Calculate the median of comp_age excluding -1 values
median_age <- median(data$Comp_history[data$Comp_history != -1])
# Replace -1 with the calculated median
data$Comp_history[data$Comp_history == -1] <- median_age
median_age
## [1] 29

Upon checking the outliers, we removed some of the outliers and missing variables. We removed the job positing from NYPD which is a data scientist full time job that only require excel skills and the average salary is 15.5K annually. Compared to the cost of living in New York an the job is full time, we believe there must be some error in this job postings. We don’t think it is possible for a job like this exist with this average salary.

We also removed the job that has rating is -1 and company size is unknown, because they don’t provide any information except the salary.

Checking the variables

##Factorize the categorized variables
data <- data %>% mutate_if(is.character, factor)

Histogram of the Average Salary

Right-skewed as expected for salary.

hist(data$AvgSalary, breaks = 20, col = "lightblue", xlab = "Average Salary", main = "Histogram of AvgSalary")

Compared the scatter plots

#Rating vs Average Salary by the Job titles 
# Scatter Plot- Original Data
ggplot(data, aes(x=Rating, y = AvgSalary, color= Job_title)) + 
  geom_point() +
  labs(title = "Scatter Plot of Rating vs. Average Salary")

# Scatter Plot Linear -Log
ggplot(data, aes(x=Rating, y = log(AvgSalary), color= Job_title)) + 
  geom_point() +
  labs(title = "Scatter Plot of Rating vs. Average Salary")

# Company Age vs. Average Salary by the Job titles 
# Scatter Plot - Original Data 
ggplot(data, aes(x=Comp_history, y = AvgSalary, color=Ownership_type)) + 
  geom_point() +
  labs(title = "Scatter Plot of Comp_History vs. Average Salary")

# Scatter Plot - Log- Linear
ggplot(data, aes(x=log(Comp_history), y = AvgSalary, color=Ownership_type)) + 
  geom_point() +
  labs(title = "Scatter Plot of Comp_History vs. Average Salary")

# Scatter Plot Linear-Log 
ggplot(data, aes(x=Comp_history, y = log(AvgSalary), color=Ownership_type)) + 
  geom_point() +
  labs(title = "Scatter Plot of Comp_History vs. Average Salary")

# Scatter Plot Log- log
ggplot(data, aes(x=log(Comp_history), y = log(AvgSalary), color=Ownership_type)) + 
  geom_point() +
  labs(title = "Scatter Plot of Comp_History vs. Average Salary")

## Ml_skills vs. Average Salary by job titles 
ggplot(data, aes(x=ML_skills, y =AvgSalary, color= Job_title)) + geom_point()

We noticed that the log transformation improved the scatter plot. We moved forward with the log transformation with our initial model

Checking EDA of all varibles

ggpairs(data[,c(1,2,3,5,6,7)], mapping=ggplot2::aes())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggpairs(data[,c(9,10,11,12,13,7)], mapping=ggplot2::aes())

ggpairs(data[,c(14,15,16,17,18,19,7)], mapping=ggplot2::aes())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggpairs(data[,c(20,21,22,23,7)], mapping=ggplot2::aes())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Checking assumptions with the original data

#Fit the model

fit = lm(AvgSalary~., data = data)
summary(fit)
## 
## Call:
## lm(formula = AvgSalary ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.571 -12.905  -1.037  10.602 108.462 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   44.73837   32.11109   1.393
## Rating                                         1.73824    1.87625   0.926
## Size10000+                                    -0.53898    8.60715  -0.063
## Size1001 - 5000                                4.05616    7.24471   0.560
## Size201 - 500                                  0.25156    6.92637   0.036
## Size5001 - 10000                              -5.36205    8.51146  -0.630
## Size501 - 1000                                -1.74591    6.97613  -0.250
## Size51 - 200                                   5.63115    6.66380   0.845
## Sizeunknown                                   30.11390   16.15176   1.864
## Ownership_typeCompany - Private              -15.11921   14.26265  -1.060
## Ownership_typeCompany - Public               -12.50770   14.71616  -0.850
## Ownership_typeGovernment                     -18.67891   21.07024  -0.887
## Ownership_typeHospital                       -41.76636   17.51145  -2.385
## Ownership_typeNonprofit Organization         -28.95782   14.64491  -1.977
## Ownership_typeOther Organization             -18.32663   18.06747  -1.014
## Ownership_typeSchool / School District       -24.69313   22.53815  -1.096
## Ownership_typeSubsidiary or Business Segment -10.58444   14.93809  -0.709
## SectorAccounting & Legal                     -12.61080   31.33039  -0.403
## SectorAerospace & Defense                     -5.84538   18.81672  -0.311
## SectorAgriculture & Forestry                 -14.89962   33.02559  -0.451
## SectorArts, Entertainment & Recreation       -36.69274   24.43375  -1.502
## SectorBiotech & Pharmaceuticals               -5.49146   17.39556  -0.316
## SectorBusiness Services                      -11.34504   18.01210  -0.630
## SectorConstruction, Repair & Maintenance     -15.38139   24.23122  -0.635
## SectorConsumer Services                      -15.35573   22.38271  -0.686
## SectorEducation                              -24.82890   19.69335  -1.261
## SectorFinance                                 -8.92907   18.12152  -0.493
## SectorGovernment                             -16.77857   19.69248  -0.852
## SectorHealth Care                             -0.53742   17.95268  -0.030
## SectorInformation Technology                  -1.38739   17.54772  -0.079
## SectorInsurance                               -3.90119   17.45519  -0.223
## SectorManufacturing                          -20.11386   18.36815  -1.095
## SectorMedia                                  -11.30796   21.02874  -0.538
## SectorMining & Metals                        -32.28058   24.01207  -1.344
## SectorNon-Profit                              -9.81162   21.55776  -0.455
## SectorOil, Gas, Energy & Utilities           -29.88576   23.92769  -1.249
## SectorReal Estate                            -19.70168   20.72451  -0.951
## SectorRetail                                  -6.01066   18.57948  -0.324
## SectorTelecommunications                     -16.04517   21.25466  -0.755
## SectorTransportation & Logistics             -17.16275   20.25440  -0.847
## SectorTravel & Tourism                       -18.67992   21.10833  -0.885
## Revenue$1 to $5 million (USD)                -20.26773   11.92480  -1.700
## Revenue$10 to $25 million (USD)              -11.37047    7.35221  -1.547
## Revenue$10+ billion (USD)                     10.20060    6.73155   1.515
## Revenue$100 to $500 million (USD)              0.03939    5.69592   0.007
## Revenue$2 to $5 billion (USD)                 -4.33756    6.53917  -0.663
## Revenue$25 to $50 million (USD)               -2.03401    7.60006  -0.268
## Revenue$5 to $10 billion (USD)                -3.84303    8.41960  -0.456
## Revenue$5 to $10 million (USD)                28.16622   10.12453   2.782
## Revenue$50 to $100 million (USD)               6.18520    6.68587   0.925
## Revenue$500 million to $1 billion (USD)       -3.24564    6.62245  -0.490
## RevenueLess than $1 million (USD)             -4.19414   17.35186  -0.242
## RevenueUnknown / Non-Applicable                0.63990    5.29845   0.121
## Hourly                                       -13.29510    9.32546  -1.426
## Job.LocationAZ                                 2.86966   13.82039   0.208
## Job.LocationCA                                47.15945   10.82261   4.357
## Job.LocationCO                                31.78393   13.83040   2.298
## Job.LocationCT                                 0.61671   17.09251   0.036
## Job.LocationDC                                37.01444   13.57808   2.726
## Job.LocationDE                                16.79117   18.19177   0.923
## Job.LocationFL                                 9.98537   12.29349   0.812
## Job.LocationGA                                -3.39986   14.83153  -0.229
## Job.LocationIA                                20.95790   16.11890   1.300
## Job.LocationID                                17.24120   21.09292   0.817
## Job.LocationIL                                37.28471   11.75181   3.173
## Job.LocationIN                                 6.68039   13.87555   0.481
## Job.LocationKS                                -7.44824   21.26598  -0.350
## Job.LocationKY                                50.35207   16.33637   3.082
## Job.LocationLA                                 9.08656   17.28016   0.526
## Job.LocationMA                                22.13743   11.29726   1.960
## Job.LocationMD                                22.34487   11.03884   2.024
## Job.LocationMI                                37.64724   15.75463   2.390
## Job.LocationMN                                39.84411   21.05357   1.893
## Job.LocationMO                                34.72739   14.33060   2.423
## Job.LocationNC                                21.80533   12.71251   1.715
## Job.LocationNE                                10.35477   17.75121   0.583
## Job.LocationNJ                                40.66820   12.64037   3.217
## Job.LocationNM                               -16.19410   18.35588  -0.882
## Job.LocationNY                                28.90479   11.20980   2.579
## Job.LocationOH                                25.25122   12.97189   1.947
## Job.LocationOR                                18.65677   17.06833   1.093
## Job.LocationPA                                20.02570   12.35532   1.621
## Job.LocationRI                                41.11211   28.75606   1.430
## Job.LocationSC                                 8.63612   27.28213   0.317
## Job.LocationTN                                 3.62162   13.25275   0.273
## Job.LocationTX                                22.28932   12.00974   1.856
## Job.LocationUT                                44.91421   16.37831   2.742
## Job.LocationVA                                19.77194   11.12446   1.777
## Job.LocationWA                                24.18280   13.40694   1.804
## Job.LocationWI                                15.02319   14.17840   1.060
## Comp_history                                   0.05272    0.03040   1.734
## Python                                         9.25073    2.66096   3.476
## aws                                            1.25197    2.73394   0.458
## excel                                          0.29790    2.21488   0.134
## sql                                           -7.45072    2.80046  -2.661
## sas                                           13.14897    4.14333   3.174
## flink                                         -2.92467    9.30753  -0.314
## mongo                                         16.14220    5.28079   3.057
## google_an                                     -6.30308    9.75108  -0.646
## Job_titledata analitics                        0.96320   10.99728   0.088
## Job_titledata engineer                        29.02483    4.56200   6.362
## Job_titledata modeler                         19.83431   13.06796   1.518
## Job_titledata scientist                       35.82931    3.99772   8.962
## Job_titleData scientist project manager        3.77889    7.80475   0.484
## Job_titledirector                             55.81974   12.23907   4.561
## Job_titlemachine learning engineer            51.51131    7.87601   6.540
## Job_titleother data related jobs               5.61757    9.75807   0.576
## Job_titleother scientist                      16.46659    5.24085   3.142
## Experienceno experience required             -10.52320   15.78448  -0.667
## Experiencesr                                  14.00099   15.90342   0.880
## Degreena                                       3.24468    2.54486   1.275
## DegreeP                                        7.87797    3.88378   2.028
## ML_skills                                      0.41671    1.51747   0.275
## vizualizationboth                             15.44948   11.05894   1.397
## vizualizationnone                             10.82997   10.53173   1.028
## vizualizationtableau                           4.19043   10.90814   0.384
## data_frameworkhadoop                          10.01528    6.00461   1.668
## data_frameworknone                             2.85529    3.88756   0.734
## data_frameworkspark                            3.86361    4.53445   0.852
##                                              Pr(>|t|)    
## (Intercept)                                  0.164046    
## Rating                                       0.354574    
## Size10000+                                   0.950090    
## Size1001 - 5000                              0.575764    
## Size201 - 500                                0.971039    
## Size5001 - 10000                             0.528939    
## Size501 - 1000                               0.802462    
## Size51 - 200                                 0.398416    
## Sizeunknown                                  0.062731 .  
## Ownership_typeCompany - Private              0.289531    
## Ownership_typeCompany - Public               0.395691    
## Ownership_typeGovernment                     0.375687    
## Ownership_typeHospital                       0.017374 *  
## Ownership_typeNonprofit Organization         0.048445 *  
## Ownership_typeOther Organization             0.310813    
## Ownership_typeSchool / School District       0.273671    
## Ownership_typeSubsidiary or Business Segment 0.478867    
## SectorAccounting & Legal                     0.687447    
## SectorAerospace & Defense                    0.756172    
## SectorAgriculture & Forestry                 0.652036    
## SectorArts, Entertainment & Recreation       0.133676    
## SectorBiotech & Pharmaceuticals              0.752350    
## SectorBusiness Services                      0.529020    
## SectorConstruction, Repair & Maintenance     0.525808    
## SectorConsumer Services                      0.492935    
## SectorEducation                              0.207862    
## SectorFinance                                0.622375    
## SectorGovernment                             0.394526    
## SectorHealth Care                            0.976128    
## SectorInformation Technology                 0.937007    
## SectorInsurance                              0.823222    
## SectorManufacturing                          0.273923    
## SectorMedia                                  0.590950    
## SectorMining & Metals                        0.179326    
## SectorNon-Profit                             0.649173    
## SectorOil, Gas, Energy & Utilities           0.212134    
## SectorReal Estate                            0.342153    
## SectorRetail                                 0.746417    
## SectorTelecommunications                     0.450594    
## SectorTransportation & Logistics             0.397121    
## SectorTravel & Tourism                       0.376523    
## Revenue$1 to $5 million (USD)                0.089701 .  
## Revenue$10 to $25 million (USD)              0.122483    
## Revenue$10+ billion (USD)                    0.130194    
## Revenue$100 to $500 million (USD)            0.994485    
## Revenue$2 to $5 billion (USD)                0.507371    
## Revenue$25 to $50 million (USD)              0.789072    
## Revenue$5 to $10 billion (USD)               0.648234    
## Revenue$5 to $10 million (USD)               0.005567 ** 
## Revenue$50 to $100 million (USD)             0.355265    
## Revenue$500 million to $1 billion (USD)      0.624239    
## RevenueLess than $1 million (USD)            0.809083    
## RevenueUnknown / Non-Applicable              0.903911    
## Hourly                                       0.154463    
## Job.LocationAZ                               0.835578    
## Job.LocationCA                               1.54e-05 ***
## Job.LocationCO                               0.021886 *  
## Job.LocationCT                               0.971229    
## Job.LocationDC                               0.006590 ** 
## Job.LocationDE                               0.356361    
## Job.LocationFL                               0.416960    
## Job.LocationGA                               0.818764    
## Job.LocationIA                               0.194012    
## Job.LocationID                               0.414017    
## Job.LocationIL                               0.001585 ** 
## Job.LocationIN                               0.630366    
## Job.LocationKS                               0.726276    
## Job.LocationKY                               0.002146 ** 
## Job.LocationLA                               0.599188    
## Job.LocationMA                               0.050496 .  
## Job.LocationMD                               0.043376 *  
## Job.LocationMI                               0.017164 *  
## Job.LocationMN                               0.058887 .  
## Job.LocationMO                               0.015664 *  
## Job.LocationNC                               0.086794 .  
## Job.LocationNE                               0.559884    
## Job.LocationNJ                               0.001361 ** 
## Job.LocationNM                               0.377994    
## Job.LocationNY                               0.010151 *  
## Job.LocationOH                               0.052031 .  
## Job.LocationOR                               0.274789    
## Job.LocationPA                               0.105564    
## Job.LocationRI                               0.153309    
## Job.LocationSC                               0.751692    
## Job.LocationTN                               0.784734    
## Job.LocationTX                               0.063935 .  
## Job.LocationUT                               0.006277 ** 
## Job.LocationVA                               0.076001 .  
## Job.LocationWA                               0.071754 .  
## Job.LocationWI                               0.289746    
## Comp_history                                 0.083402 .  
## Python                                       0.000543 ***
## aws                                          0.647159    
## excel                                        0.893052    
## sql                                          0.008003 ** 
## sas                                          0.001580 ** 
## flink                                        0.753454    
## mongo                                        0.002333 ** 
## google_an                                    0.518260    
## Job_titledata analitics                      0.930234    
## Job_titledata engineer                       3.86e-10 ***
## Job_titledata modeler                        0.129578    
## Job_titledata scientist                       < 2e-16 ***
## Job_titleData scientist project manager      0.628430    
## Job_titledirector                            6.14e-06 ***
## Job_titlemachine learning engineer           1.28e-10 ***
## Job_titleother data related jobs             0.565037    
## Job_titleother scientist                     0.001758 ** 
## Experienceno experience required             0.505224    
## Experiencesr                                 0.378996    
## Degreena                                     0.202787    
## DegreeP                                      0.042942 *  
## ML_skills                                    0.783710    
## vizualizationboth                            0.162908    
## vizualizationnone                            0.304200    
## vizualizationtableau                         0.700994    
## data_frameworkhadoop                         0.095833 .  
## data_frameworknone                           0.462940    
## data_frameworkspark                          0.394511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.57 on 622 degrees of freedom
## Multiple R-squared:  0.6366, Adjusted R-squared:  0.5677 
## F-statistic: 9.236 on 118 and 622 DF,  p-value: < 2.2e-16
# QQ plot of the residuals
qqnorm(residuals(fit))
qqline(residuals(fit))

# Diagnostic plots
par(mfrow = c(2, 3))

# Residuals vs Fitted Values
plot(fit, which = 1)

# Scale-Location (also called Spread-Location)
plot(fit, which = 3)
## Warning: not plotting observations with leverage one:
##   290, 449
# Cook's distance plot
plot(fit, which = 4)

# Residuals vs Leverage
plot(fit, which = 5)
## Warning: not plotting observations with leverage one:
##   290, 449
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
# Histogram of residuals
hist(resid(fit))

Independence: We assume that each job posting on Glassdor is independent.

Linearity: Even though the linearity between the salary an the predictors individual not very strong (0.324 as the highest correlation), we assume that there is a linearity between the average salary with all of the predictors remained in the model.

Normality: is a little right skewed but not a big concern here.

Homoscedasticity: We can see that the residuals do not form a random clouds in the residual vs. fitted plot. There are evidence of the non constant variances. Therefore, we will continue with the log transformation of the average salary.

Model1: Checking assumptions with logged average salary

#Fit the model 
data$log_Comp_history <-log(data$Comp_history)
logfit = lm(log(AvgSalary)~. -Comp_history, data = data)
summary(logfit)
## 
## Call:
## lm(formula = log(AvgSalary) ~ . - Comp_history, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70870 -0.12663 -0.00022  0.13511  0.68143 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   3.983480   0.303570  13.122
## Rating                                        0.031741   0.017153   1.850
## Size10000+                                   -0.070397   0.078673  -0.895
## Size1001 - 5000                              -0.005350   0.066295  -0.081
## Size201 - 500                                -0.018578   0.063368  -0.293
## Size5001 - 10000                             -0.080856   0.077686  -1.041
## Size501 - 1000                               -0.067763   0.063739  -1.063
## Size51 - 200                                  0.022459   0.061006   0.368
## Sizeunknown                                   0.328132   0.148097   2.216
## Ownership_typeCompany - Private              -0.252642   0.131799  -1.917
## Ownership_typeCompany - Public               -0.195259   0.135518  -1.441
## Ownership_typeGovernment                     -0.217954   0.193397  -1.127
## Ownership_typeHospital                       -0.439855   0.160922  -2.733
## Ownership_typeNonprofit Organization         -0.386823   0.134372  -2.879
## Ownership_typeOther Organization             -0.307498   0.166020  -1.852
## Ownership_typeSchool / School District       -0.292904   0.206609  -1.418
## Ownership_typeSubsidiary or Business Segment -0.233690   0.137505  -1.700
## SectorAccounting & Legal                     -0.124767   0.286391  -0.436
## SectorAerospace & Defense                    -0.089084   0.172109  -0.518
## SectorAgriculture & Forestry                 -0.101400   0.302285  -0.335
## SectorArts, Entertainment & Recreation       -0.488445   0.223348  -2.187
## SectorBiotech & Pharmaceuticals              -0.073336   0.159074  -0.461
## SectorBusiness Services                      -0.158007   0.164640  -0.960
## SectorConstruction, Repair & Maintenance     -0.103899   0.221951  -0.468
## SectorConsumer Services                      -0.152161   0.204661  -0.743
## SectorEducation                              -0.296657   0.180198  -1.646
## SectorFinance                                -0.156146   0.165662  -0.943
## SectorGovernment                             -0.184666   0.180021  -1.026
## SectorHealth Care                            -0.029150   0.164200  -0.178
## SectorInformation Technology                 -0.084921   0.160509  -0.529
## SectorInsurance                              -0.090207   0.159545  -0.565
## SectorManufacturing                          -0.208316   0.167914  -1.241
## SectorMedia                                  -0.144420   0.192204  -0.751
## SectorMining & Metals                        -0.282324   0.218630  -1.291
## SectorNon-Profit                             -0.132115   0.197121  -0.670
## SectorOil, Gas, Energy & Utilities           -0.335708   0.218696  -1.535
## SectorReal Estate                            -0.206678   0.189530  -1.090
## SectorRetail                                 -0.055485   0.169846  -0.327
## SectorTelecommunications                     -0.265296   0.193986  -1.368
## SectorTransportation & Logistics             -0.166545   0.185119  -0.900
## SectorTravel & Tourism                       -0.202046   0.192967  -1.047
## Revenue$1 to $5 million (USD)                -0.204127   0.109236  -1.869
## Revenue$10 to $25 million (USD)              -0.139671   0.067201  -2.078
## Revenue$10+ billion (USD)                     0.113521   0.060257   1.884
## Revenue$100 to $500 million (USD)            -0.011986   0.052135  -0.230
## Revenue$2 to $5 billion (USD)                -0.072780   0.060307  -1.207
## Revenue$25 to $50 million (USD)              -0.060709   0.069473  -0.874
## Revenue$5 to $10 billion (USD)               -0.052888   0.076841  -0.688
## Revenue$5 to $10 million (USD)                0.225070   0.092541   2.432
## Revenue$50 to $100 million (USD)              0.024643   0.061147   0.403
## Revenue$500 million to $1 billion (USD)      -0.052393   0.060660  -0.864
## RevenueLess than $1 million (USD)            -0.060789   0.158634  -0.383
## RevenueUnknown / Non-Applicable              -0.015636   0.048875  -0.320
## Hourly                                       -0.352789   0.085077  -4.147
## Job.LocationAZ                                0.088787   0.126397   0.702
## Job.LocationCA                                0.491711   0.098913   4.971
## Job.LocationCO                                0.393735   0.126573   3.111
## Job.LocationCT                                0.085571   0.156168   0.548
## Job.LocationDC                                0.364760   0.124101   2.939
## Job.LocationDE                                0.382696   0.166284   2.301
## Job.LocationFL                                0.127096   0.112536   1.129
## Job.LocationGA                                0.049514   0.135688   0.365
## Job.LocationIA                                0.187296   0.147358   1.271
## Job.LocationID                                0.206046   0.192813   1.069
## Job.LocationIL                                0.371819   0.107428   3.461
## Job.LocationIN                                0.131656   0.127039   1.036
## Job.LocationKS                                0.068822   0.194333   0.354
## Job.LocationKY                                0.587230   0.149323   3.933
## Job.LocationLA                                0.102116   0.158319   0.645
## Job.LocationMA                                0.291315   0.103225   2.822
## Job.LocationMD                                0.336930   0.101019   3.335
## Job.LocationMI                                0.469252   0.144102   3.256
## Job.LocationMN                                0.466280   0.192446   2.423
## Job.LocationMO                                0.356299   0.131078   2.718
## Job.LocationNC                                0.297738   0.116273   2.561
## Job.LocationNE                                0.160604   0.162177   0.990
## Job.LocationNJ                                0.482334   0.115494   4.176
## Job.LocationNM                               -0.091422   0.167682  -0.545
## Job.LocationNY                                0.345844   0.102488   3.374
## Job.LocationOH                                0.330631   0.118684   2.786
## Job.LocationOR                                0.277846   0.156253   1.778
## Job.LocationPA                                0.265526   0.113046   2.349
## Job.LocationRI                                0.679800   0.263015   2.585
## Job.LocationSC                                0.129786   0.249392   0.520
## Job.LocationTN                                0.072963   0.121036   0.603
## Job.LocationTX                                0.273732   0.109791   2.493
## Job.LocationUT                                0.460476   0.149605   3.078
## Job.LocationVA                                0.278813   0.101778   2.739
## Job.LocationWA                                0.280368   0.122625   2.286
## Job.LocationWI                                0.196728   0.129778   1.516
## Python                                        0.079923   0.024322   3.286
## aws                                           0.004677   0.025019   0.187
## excel                                         0.010542   0.020224   0.521
## sql                                          -0.043798   0.025591  -1.711
## sas                                           0.102320   0.037617   2.720
## flink                                         0.011613   0.085075   0.136
## mongo                                         0.170830   0.048296   3.537
## google_an                                    -0.071536   0.089185  -0.802
## Job_titledata analitics                       0.045318   0.100507   0.451
## Job_titledata engineer                        0.384302   0.041712   9.213
## Job_titledata modeler                         0.258836   0.119365   2.168
## Job_titledata scientist                       0.419277   0.036592  11.458
## Job_titleData scientist project manager       0.065028   0.071412   0.911
## Job_titledirector                             0.611658   0.111858   5.468
## Job_titlemachine learning engineer            0.554015   0.072025   7.692
## Job_titleother data related jobs              0.140856   0.089151   1.580
## Job_titleother scientist                      0.234983   0.047982   4.897
## Experienceno experience required             -0.084432   0.144116  -0.586
## Experiencesr                                  0.141686   0.145236   0.976
## Degreena                                      0.010001   0.023121   0.433
## DegreeP                                       0.040409   0.035182   1.149
## ML_skills                                     0.014921   0.013871   1.076
## vizualizationboth                             0.245430   0.101218   2.425
## vizualizationnone                             0.170059   0.096223   1.767
## vizualizationtableau                          0.124925   0.099689   1.253
## data_frameworkhadoop                          0.085295   0.054868   1.555
## data_frameworknone                            0.025254   0.035538   0.711
## data_frameworkspark                           0.028888   0.041589   0.695
## log_Comp_history                             -0.003586   0.014911  -0.240
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## Rating                                       0.064719 .  
## Size10000+                                   0.371236    
## Size1001 - 5000                              0.935708    
## Size201 - 500                                0.769480    
## Size5001 - 10000                             0.298370    
## Size501 - 1000                               0.288136    
## Size51 - 200                                 0.712892    
## Sizeunknown                                  0.027077 *  
## Ownership_typeCompany - Private              0.055712 .  
## Ownership_typeCompany - Public               0.150134    
## Ownership_typeGovernment                     0.260186    
## Ownership_typeHospital                       0.006448 ** 
## Ownership_typeNonprofit Organization         0.004130 ** 
## Ownership_typeOther Organization             0.064475 .  
## Ownership_typeSchool / School District       0.156787    
## Ownership_typeSubsidiary or Business Segment 0.089724 .  
## SectorAccounting & Legal                     0.663240    
## SectorAerospace & Defense                    0.604919    
## SectorAgriculture & Forestry                 0.737403    
## SectorArts, Entertainment & Recreation       0.029120 *  
## SectorBiotech & Pharmaceuticals              0.644946    
## SectorBusiness Services                      0.337574    
## SectorConstruction, Repair & Maintenance     0.639863    
## SectorConsumer Services                      0.457474    
## SectorEducation                              0.100210    
## SectorFinance                                0.346271    
## SectorGovernment                             0.305383    
## SectorHealth Care                            0.859154    
## SectorInformation Technology                 0.596942    
## SectorInsurance                              0.572006    
## SectorManufacturing                          0.215219    
## SectorMedia                                  0.452703    
## SectorMining & Metals                        0.197067    
## SectorNon-Profit                             0.502964    
## SectorOil, Gas, Energy & Utilities           0.125281    
## SectorReal Estate                            0.275927    
## SectorRetail                                 0.744023    
## SectorTelecommunications                     0.171929    
## SectorTransportation & Logistics             0.368648    
## SectorTravel & Tourism                       0.295482    
## Revenue$1 to $5 million (USD)                0.062136 .  
## Revenue$10 to $25 million (USD)              0.038083 *  
## Revenue$10+ billion (USD)                    0.060039 .  
## Revenue$100 to $500 million (USD)            0.818238    
## Revenue$2 to $5 billion (USD)                0.227958    
## Revenue$25 to $50 million (USD)              0.382532    
## Revenue$5 to $10 billion (USD)               0.491531    
## Revenue$5 to $10 million (USD)               0.015292 *  
## Revenue$50 to $100 million (USD)             0.687082    
## Revenue$500 million to $1 billion (USD)      0.388074    
## RevenueLess than $1 million (USD)            0.701701    
## RevenueUnknown / Non-Applicable              0.749135    
## Hourly                                       3.84e-05 ***
## Job.LocationAZ                               0.482668    
## Job.LocationCA                               8.62e-07 ***
## Job.LocationCO                               0.001952 ** 
## Job.LocationCT                               0.583926    
## Job.LocationDC                               0.003413 ** 
## Job.LocationDE                               0.021695 *  
## Job.LocationFL                               0.259171    
## Job.LocationGA                               0.715304    
## Job.LocationIA                               0.204193    
## Job.LocationID                               0.285649    
## Job.LocationIL                               0.000575 ***
## Job.LocationIN                               0.300442    
## Job.LocationKS                               0.723352    
## Job.LocationKY                               9.35e-05 ***
## Job.LocationLA                               0.519165    
## Job.LocationMA                               0.004923 ** 
## Job.LocationMD                               0.000903 ***
## Job.LocationMI                               0.001190 ** 
## Job.LocationMN                               0.015681 *  
## Job.LocationMO                               0.006747 ** 
## Job.LocationNC                               0.010682 *  
## Job.LocationNE                               0.322412    
## Job.LocationNJ                               3.39e-05 ***
## Job.LocationNM                               0.585805    
## Job.LocationNY                               0.000786 ***
## Job.LocationOH                               0.005502 ** 
## Job.LocationOR                               0.075862 .  
## Job.LocationPA                               0.019145 *  
## Job.LocationRI                               0.009975 ** 
## Job.LocationSC                               0.602963    
## Job.LocationTN                               0.546850    
## Job.LocationTX                               0.012918 *  
## Job.LocationUT                               0.002176 ** 
## Job.LocationVA                               0.006331 ** 
## Job.LocationWA                               0.022566 *  
## Job.LocationWI                               0.130056    
## Python                                       0.001073 ** 
## aws                                          0.851773    
## excel                                        0.602373    
## sql                                          0.087497 .  
## sas                                          0.006709 ** 
## flink                                        0.891473    
## mongo                                        0.000435 ***
## google_an                                    0.422799    
## Job_titledata analitics                      0.652223    
## Job_titledata engineer                        < 2e-16 ***
## Job_titledata modeler                        0.030503 *  
## Job_titledata scientist                       < 2e-16 ***
## Job_titleData scientist project manager      0.362858    
## Job_titledirector                            6.59e-08 ***
## Job_titlemachine learning engineer           5.70e-14 ***
## Job_titleother data related jobs             0.114624    
## Job_titleother scientist                     1.24e-06 ***
## Experienceno experience required             0.558180    
## Experiencesr                                 0.329664    
## Degreena                                     0.665484    
## DegreeP                                      0.251170    
## ML_skills                                    0.282502    
## vizualizationboth                            0.015602 *  
## vizualizationnone                            0.077661 .  
## vizualizationtableau                         0.210625    
## data_frameworkhadoop                         0.120558    
## data_frameworknone                           0.477579    
## data_frameworkspark                          0.487561    
## log_Comp_history                             0.810034    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2246 on 622 degrees of freedom
## Multiple R-squared:  0.6943, Adjusted R-squared:  0.6363 
## F-statistic: 11.97 on 118 and 622 DF,  p-value: < 2.2e-16
confint(logfit)
##                                                     2.5 %       97.5 %
## (Intercept)                                   3.387333290  4.579627381
## Rating                                       -0.001943635  0.065426517
## Size10000+                                   -0.224894194  0.084099609
## Size1001 - 5000                              -0.135539716  0.124839950
## Size201 - 500                                -0.143018889  0.105862283
## Size5001 - 10000                             -0.233413320  0.071702168
## Size501 - 1000                               -0.192931830  0.057406757
## Size51 - 200                                 -0.097344296  0.142262308
## Sizeunknown                                   0.037300111  0.618963372
## Ownership_typeCompany - Private              -0.511466521  0.006182750
## Ownership_typeCompany - Public               -0.461386596  0.070868774
## Ownership_typeGovernment                     -0.597744742  0.161835848
## Ownership_typeHospital                       -0.755872030 -0.123837734
## Ownership_typeNonprofit Organization         -0.650701585 -0.122944941
## Ownership_typeOther Organization             -0.633525544  0.018530363
## Ownership_typeSchool / School District       -0.698638789  0.112831420
## Ownership_typeSubsidiary or Business Segment -0.503719398  0.036339960
## SectorAccounting & Legal                     -0.687176272  0.437642833
## SectorAerospace & Defense                    -0.427068574  0.248900279
## SectorAgriculture & Forestry                 -0.695022443  0.492222771
## SectorArts, Entertainment & Recreation       -0.927051366 -0.049837929
## SectorBiotech & Pharmaceuticals              -0.385723169  0.239051078
## SectorBusiness Services                      -0.481325275  0.165311100
## SectorConstruction, Repair & Maintenance     -0.539762932  0.331964159
## SectorConsumer Services                      -0.554070817  0.249749727
## SectorEducation                              -0.650526955  0.057212868
## SectorFinance                                -0.481469825  0.169177467
## SectorGovernment                             -0.538187334  0.168856151
## SectorHealth Care                            -0.351602099  0.293303010
## SectorInformation Technology                 -0.400125702  0.230283296
## SectorInsurance                              -0.403519637  0.223106126
## SectorManufacturing                          -0.538063564  0.121432320
## SectorMedia                                  -0.521866647  0.233027386
## SectorMining & Metals                        -0.711665877  0.147017574
## SectorNon-Profit                             -0.519218902  0.254988514
## SectorOil, Gas, Energy & Utilities           -0.765179416  0.093763159
## SectorReal Estate                            -0.578874155  0.165518854
## SectorRetail                                 -0.389026764  0.278057283
## SectorTelecommunications                     -0.646243026  0.115650135
## SectorTransportation & Logistics             -0.530079530  0.196989548
## SectorTravel & Tourism                       -0.580991167  0.176898607
## Revenue$1 to $5 million (USD)                -0.418642823  0.010388132
## Revenue$10 to $25 million (USD)              -0.271639869 -0.007701451
## Revenue$10+ billion (USD)                    -0.004811132  0.231853428
## Revenue$100 to $500 million (USD)            -0.114367565  0.090395114
## Revenue$2 to $5 billion (USD)                -0.191209533  0.045649889
## Revenue$25 to $50 million (USD)              -0.197138954  0.075720035
## Revenue$5 to $10 billion (USD)               -0.203786519  0.098010274
## Revenue$5 to $10 million (USD)                0.043339003  0.406800326
## Revenue$50 to $100 million (USD)             -0.095437495  0.144723066
## Revenue$500 million to $1 billion (USD)      -0.171515499  0.066729438
## RevenueLess than $1 million (USD)            -0.372312225  0.250734566
## RevenueUnknown / Non-Applicable              -0.111617247  0.080344605
## Hourly                                       -0.519862271 -0.185716042
## Job.LocationAZ                               -0.159430900  0.337004187
## Job.LocationCA                                0.297466465  0.685954747
## Job.LocationCO                                0.145173109  0.642295969
## Job.LocationCT                               -0.221108865  0.392251664
## Job.LocationDC                                0.121051541  0.608468558
## Job.LocationDE                                0.056150929  0.709241232
## Job.LocationFL                               -0.093899860  0.348091719
## Job.LocationGA                               -0.216948708  0.315976087
## Job.LocationIA                               -0.102083053  0.476675703
## Job.LocationID                               -0.172596272  0.584689263
## Job.LocationIL                                0.160853588  0.582783662
## Job.LocationIN                               -0.117820329  0.381133055
## Job.LocationKS                               -0.312806855  0.450450007
## Job.LocationKY                                0.293990638  0.880468773
## Job.LocationLA                               -0.208789162  0.413021221
## Job.LocationMA                                0.088604729  0.494026232
## Job.LocationMD                                0.138550190  0.535310404
## Job.LocationMI                                0.186265882  0.752238571
## Job.LocationMN                                0.088356688  0.844203632
## Job.LocationMO                                0.098888895  0.613708888
## Job.LocationNC                                0.069402644  0.526072361
## Job.LocationNE                               -0.157876737  0.479085423
## Job.LocationNJ                                0.255529921  0.709138957
## Job.LocationNM                               -0.420713428  0.237869703
## Job.LocationNY                                0.144579393  0.547107772
## Job.LocationOH                                0.097560238  0.563701869
## Job.LocationOR                               -0.029001032  0.584692656
## Job.LocationPA                                0.043527700  0.487524619
## Job.LocationRI                                0.163295343  1.196305391
## Job.LocationSC                               -0.359966179  0.619538484
## Job.LocationTN                               -0.164726664  0.310652568
## Job.LocationTX                                0.058126754  0.489338002
## Job.LocationUT                                0.166683807  0.754269030
## Job.LocationVA                                0.078942924  0.478683251
## Job.LocationWA                                0.039560008  0.521176860
## Job.LocationWI                               -0.058127033  0.451583939
## Python                                        0.032160029  0.127685093
## aws                                          -0.044454573  0.053808306
## excel                                        -0.029173249  0.050256771
## sql                                          -0.094053606  0.006457681
## sas                                           0.028449007  0.176191438
## flink                                        -0.155457059  0.178682092
## mongo                                         0.075987969  0.265672516
## google_an                                    -0.246676336  0.103604642
## Job_titledata analitics                      -0.152056715  0.242692906
## Job_titledata engineer                        0.302389520  0.466214867
## Job_titledata modeler                         0.024429655  0.493243255
## Job_titledata scientist                       0.347417429  0.491136954
## Job_titleData scientist project manager      -0.075209883  0.205265737
## Job_titledirector                             0.391992927  0.831323929
## Job_titlemachine learning engineer            0.412573117  0.695457597
## Job_titleother data related jobs             -0.034218488  0.315930187
## Job_titleother scientist                      0.140757166  0.329209525
## Experienceno experience required             -0.367443966  0.198579970
## Experiencesr                                 -0.143526580  0.426898155
## Degreena                                     -0.035403405  0.055405854
## DegreeP                                      -0.028680267  0.109498616
## ML_skills                                    -0.012319693  0.042161085
## vizualizationboth                             0.046659162  0.444200625
## vizualizationnone                            -0.018902274  0.359020004
## vizualizationtableau                         -0.070843327  0.320692916
## data_frameworkhadoop                         -0.022452801  0.193043286
## data_frameworknone                           -0.044534265  0.095042783
## data_frameworkspark                          -0.052784407  0.110560963
## log_Comp_history                             -0.032867637  0.025695951
# QQ plot of the residuals
qqnorm(residuals(logfit))
qqline(residuals(logfit))

# Diagnostic plots
par(mfrow = c(2, 3))

# Residuals vs Fitted Values
plot(logfit, which = 1)

# Scale-Location (also called Spread-Location)
plot(logfit, which = 3)
## Warning: not plotting observations with leverage one:
##   290, 449
# Cook's distance plot
plot(logfit, which = 4)

# Residuals vs Leverage
plot(logfit, which = 5)
## Warning: not plotting observations with leverage one:
##   290, 449
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
# Histogram of residuals
hist(resid(logfit))

Independence: We assume that each job posting on Glassdor is independent. The log transformation does not affect the independence between the data points.

Linearity: Even though the linearity between the salary an the predictors individual not very strong (0.324 as the highest correlation), we assume that there is a linearity between the average salary with all of the predictors remained in the model.

Normality: The distribution of the residuals is less skewed. We assume that the normality is met.

Homoscedasticity: The log transformation also improved the inconstant variance issue. The residuals in the residual vs. fitted value plot appear to form a random cloud around 0.

There are two observations 290 and 449 appeared to have leverage higher than 1. After checking the data set to see if there is any error in the in the data collecting process, we did not see any reasons or explanations why these two observations were outliers. Their average salary in close to the mean salary, their predictor values were also not in the extreme sides. We also fitted the models without the outliers and see no much of a difference in our p-values or the coefficients. Therefore, we decided to keep these outliers and will interpret the model with the logged salary and all of the predictors in the current data set.

Interpretaion of the Model 1:

The reference for the interpretation of the coefficients: a full-time junior analyst job from a college/ university organization in Alabama. We don’t know the revenue, the sector but we know the company size is about 1-50 employees. The requirement for this job is master degree and have to know both bi and tableau.

There is no enough significant evidence to suggest that the company rating or the company age is associated with the change in the median salary, keeping the other predictors constant (p-value = 0.064 and 0.81 respectively).

However, there are overwhelming evidence to suggest that some locations are associated with the change in the median salary. For example, for same job as our reference holding all of the other variables but if the job is in California, the median salary would have a 63% increase (p-value. <0.0001). A 95% confidence interval for this increase is from 35%-99% in the median salary.

Similarly, there is evidence that knowing Python is an advantage. For the sam job with everything else held constant as the reference, additional requirement of Python is associated with 8% increase in the median salary (p-value = 0.001). A 95% confidence interval of this increase is from 3% to 14%.

Also, as we all wanted to know if the data scientist is a good investment. There is strong evidence to suggest that the data scientist job title is associated with the higher pay than an analyst job even if other variables are unchanged from the reference (p-value <0.0001). It is estimated that the median salary of data scientist job title is 52% higher than median salary of a analyst job, holding the other predictors constant. A 95% confidence interval for this increase is from 41% to 63%.

Model1 - VIFs

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif_values <- vif(logfit)
# Display the VIF values
print(vif_values)
##                          GVIF Df GVIF^(1/(2*Df))
## Rating           2.773333e+00  1        1.665333
## Size             7.176802e+02  7        1.599540
## Ownership_type   9.851103e+02  8        1.538483
## Sector           3.040314e+06 24        1.364774
## Revenue          3.490705e+03 12        1.404822
## Hourly           3.331298e+00  1        1.825184
## Job.Location     2.277312e+05 36        1.186885
## Python           2.164492e+00  1        1.471221
## aws              1.664754e+00  1        1.290253
## excel            1.498627e+00  1        1.224184
## sql              2.402871e+00  1        1.550120
## sas              1.686041e+00  1        1.298476
## flink            1.415083e+00  1        1.189573
## mongo            1.624981e+00  1        1.274747
## google_an        2.165237e+00  1        1.471474
## Job_title        9.586276e+01  9        1.288521
## Experience       1.715823e+00  2        1.144506
## Degree           3.224348e+00  2        1.340017
## ML_skills        1.690143e+00  1        1.300055
## vizualization    4.322933e+00  3        1.276330
## data_framework   4.582293e+00  3        1.288785
## log_Comp_history 3.165049e+00  1        1.779058

Model 2: Foward Feature Selection

library(leaps)
set.seed(123)
reg.fwd=regsubsets(log(AvgSalary)~.,data=data,method="forward",nvmax=40)
summary(reg.fwd)$adjr2
##  [1] 0.1692560 0.2756362 0.3344676 0.3843202 0.4270107 0.4610551 0.4845899
##  [8] 0.5157771 0.5329167 0.5433114 0.5531788 0.5630019 0.5714122 0.5785861
## [15] 0.5849125 0.5899307 0.5944495 0.5988805 0.6019873 0.6053165 0.6081596
## [22] 0.6106036 0.6126111 0.6148856 0.6166975 0.6184557 0.6200230 0.6214625
## [29] 0.6228650 0.6242618 0.6257571 0.6279423 0.6297105 0.6311011 0.6323866
## [36] 0.6332987 0.6343446 0.6353733 0.6361483 0.6369464
summary(reg.fwd)$rss
##  [1] 85.18064 74.17240 68.05592 62.87268 58.43366 54.88703 52.41869 49.17967
##  [9] 47.37410 46.25645 45.19501 44.14079 43.23182 42.44971 41.75485 41.19316
## [17] 40.68295 40.18280 39.81635 39.42854 39.09015 38.79231 38.53857 38.25886
## [25] 38.02567 37.79831 37.59032 37.39539 37.20452 37.01459 36.81536 36.54878
## [33] 36.32369 36.13610 35.95917 35.81908 35.66618 35.51524 35.38928 35.26128
summary(reg.fwd)$bic
##  [1] -125.1923 -221.1254 -278.2896 -330.3822 -378.0300 -417.8197 -445.3079
##  [8] -485.9631 -507.0719 -518.1552 -528.7488 -539.6302 -548.4406 -555.3608
## [15] -560.9826 -564.4103 -567.0374 -569.5956 -569.7762 -570.4209 -570.1999
## [22] -569.2595 -567.5143 -566.3040 -564.2262 -562.0620 -559.5427 -556.7872
## [29] -553.9711 -551.1556 -548.5469 -547.3240 -545.2935 -542.5223 -539.5513
## [36] -535.8358 -532.3976 -528.9321 -524.9570 -521.0340
par(mfrow=c(1,3))
bics<-summary(reg.fwd)$bic
bics
##  [1] -125.1923 -221.1254 -278.2896 -330.3822 -378.0300 -417.8197 -445.3079
##  [8] -485.9631 -507.0719 -518.1552 -528.7488 -539.6302 -548.4406 -555.3608
## [15] -560.9826 -564.4103 -567.0374 -569.5956 -569.7762 -570.4209 -570.1999
## [22] -569.2595 -567.5143 -566.3040 -564.2262 -562.0620 -559.5427 -556.7872
## [29] -553.9711 -551.1556 -548.5469 -547.3240 -545.2935 -542.5223 -539.5513
## [36] -535.8358 -532.3976 -528.9321 -524.9570 -521.0340
plot(1:40,bics,type="l",ylab="BIC",xlab="# of predictors")
index<-which(bics==min(bics))
points(index,bics[index],col="red",pch=10)

adjr2<-summary(reg.fwd)$adjr2
plot(1:40,adjr2,type="l",ylab="Adjusted R-squared",xlab="# of predictors")
index<-which(adjr2==max(adjr2))
points(index,adjr2[index],col="red",pch=10)

rss<-summary(reg.fwd)$rss
plot(1:40,rss,type="l",ylab="train RSS",xlab="# of predictors")
index<-which(rss==min(rss))
points(index,rss[index],col="red",pch=10)

coef(reg.fwd,20)
##                            (Intercept)                            Sizeunknown 
##                             3.98722404                             0.38919056 
## SectorArts, Entertainment & Recreation        SectorBiotech & Pharmaceuticals 
##                            -0.36916639                             0.08067264 
##           SectorInformation Technology              Revenue$10+ billion (USD) 
##                             0.07653221                             0.13111168 
##         Revenue$5 to $10 million (USD)                                 Hourly 
##                             0.27740529                            -0.32720170 
##                         Job.LocationCA                         Job.LocationIL 
##                             0.26350615                             0.14470722 
##                         Job.LocationMA                         Job.LocationMD 
##                             0.10443946                             0.11547814 
##                         Job.LocationNJ                         Job.LocationNY 
##                             0.25304586                             0.09161447 
##                                 Python                 Job_titledata engineer 
##                             0.07122945                             0.38634057 
##                Job_titledata scientist                      Job_titledirector 
##                             0.44838975                             0.63346997 
##     Job_titlemachine learning engineer               Job_titleother scientist 
##                             0.57868273                             0.22483849 
##                           Experiencesr 
##                             0.21739204

Notably, the forward feature selection model based on Bic has fewer predictors than the one based on the adjusted r squared or rss. Among the 20 predictors, some of them are statistical significant in the the model 2, including but not limited to Python, Job.LocationCA, and Job_titledatascientist. However, there are also predictors appeared to be statistically insignificant in our previous model but now enter the feature selection model, such as Experiencesr or SectorBiotech & Pharmaceuticals. We found this is very interesting as we know that in reality, people from senior level are paid higher than the junior level. This is evidence that we can not always trust the results right away but instead using our domain knowledge to find the appropriate predictors for our model.

We then move forward with the new subset of predictors which our group believe are practically or statistically significant to the salary. In. model 3, we will implement the glmnet technique and CV to collect the RMSE for the model comparison in objective 2.

Model 3: glmnet + no interaction term + log transformation

#glmnet - Regression Coefficients
library(caret)
set.seed(123)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=10) 
glmnet.logfit1<-train(log(AvgSalary)~Job_title + Hourly + Job.Location + Python + 
                  sas + mongo + Experience + Degree,
               data=data,
               method="glmnet",
               trControl=fitControl
               )
opt.pen1<-glmnet.logfit1$finalModel$lambdaOpt 
coef(glmnet.logfit1$finalModel,opt.pen1)
## 54 x 1 sparse Matrix of class "dgCMatrix"
##                                                    s1
## (Intercept)                              4.0155680205
## Job_titledata analitics                  0.0805688036
## Job_titledata engineer                   0.4109063120
## Job_titledata modeler                    0.2109039202
## Job_titledata scientist                  0.4822331191
## Job_titleData scientist project manager  0.0434483727
## Job_titledirector                        0.6421384942
## Job_titlemachine learning engineer       0.5707006498
## Job_titleother data related jobs         0.2376108630
## Job_titleother scientist                 0.2589587977
## Hourly                                  -0.3793169705
## Job.LocationAZ                          -0.1188641425
## Job.LocationCA                           0.2353480469
## Job.LocationCO                           0.0121288414
## Job.LocationCT                           .           
## Job.LocationDC                           0.1566006592
## Job.LocationDE                           0.1197544067
## Job.LocationFL                          -0.1254833908
## Job.LocationGA                          -0.1337735800
## Job.LocationIA                          -0.0905371999
## Job.LocationID                           .           
## Job.LocationIL                           0.1022425407
## Job.LocationIN                          -0.0878126905
## Job.LocationKS                          -0.1650966884
## Job.LocationKY                           0.1776508822
## Job.LocationLA                          -0.1405475825
## Job.LocationMA                           0.1449365844
## Job.LocationMD                           0.1198878298
## Job.LocationMI                           0.1076424850
## Job.LocationMN                           0.1522937183
## Job.LocationMO                           0.0335504368
## Job.LocationNC                           0.0177597799
## Job.LocationNE                          -0.0931505853
## Job.LocationNJ                           0.2133951884
## Job.LocationNM                          -0.2388818430
## Job.LocationNY                           0.0560818741
## Job.LocationOH                           .           
## Job.LocationOR                           .           
## Job.LocationPA                           0.0155628635
## Job.LocationRI                           0.2908423291
## Job.LocationSC                          -0.1254685601
## Job.LocationTN                          -0.1213105284
## Job.LocationTX                          -0.0001210612
## Job.LocationUT                           0.0618885752
## Job.LocationVA                          -0.0021349404
## Job.LocationWA                          -0.0694423877
## Job.LocationWI                          -0.0522838394
## Python                                   0.0717481240
## sas                                      0.0149484861
## mongo                                    0.0762741836
## Experienceno experience required         .           
## Experiencesr                             0.2367047984
## Degreena                                 .           
## DegreeP                                  0.0444660748
glmnet.logfit1
## glmnet 
## 
## 741 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 666, 667, 667, 667, 666, 667, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0003072974  0.2493264  0.5561643  0.1951310
##   0.10   0.0030729741  0.2492737  0.5553024  0.1946357
##   0.10   0.0307297408  0.2545946  0.5396723  0.1974444
##   0.55   0.0003072974  0.2491700  0.5564643  0.1949388
##   0.55   0.0030729741  0.2490330  0.5556131  0.1942840
##   0.55   0.0307297408  0.2692501  0.4981201  0.2096308
##   1.00   0.0003072974  0.2491092  0.5565036  0.1948045
##   1.00   0.0030729741  0.2491404  0.5552192  0.1944404
##   1.00   0.0307297408  0.2816638  0.4657545  0.2217544
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.55 and lambda = 0.003072974.
plot(glmnet.logfit1)

The best RMSE from model 3 is 0.2490 which is about 28% than the actual salary. Notice that the alpha is 0.55, and the lambda or the penalty term is 0.003. We will check for mother EDA to consider the appropriate interation term and ad more complexity to our model.

More EDA
ggplot(data=data,aes(x=Job.Location,y=log(AvgSalary),colour=Job_title))+geom_boxplot()

It is hard to say whether the job titles have different slope in this boxplot. Some states we barely have any formation, like Kentucky or South Carolina.

ggplot(data=data,aes(x=Job_title,y=log(AvgSalary),colour=Experience))+geom_boxplot()

However, we believe there is some interaction between the job titles and the experience levels. In some job, there is a significant difference in salary between the the jobs required the senior experience level and the jobs do not require a specific experience level, like data scientist and analyst. Whereas, with the jobs that have need some type of data skills, the median pay between senior experience required and no experience required are almost the same.

ggplot(data=data,aes(x=Job_title,y=log(AvgSalary),colour=Degree))+geom_boxplot()

ggplot(data=data,aes(x=Experience,y=log(AvgSalary),colour=Degree))+geom_boxplot()

Similarly, we believe that there may be an interaction between the experience level and the degree required. For the jobs that need the senior experience level, there is not much of a difference in the median salary if the jobs require different level of education. It is not the case for the jobs that do not require a specific level of seniority, as the higher degree are evidently associated with the higher pay.

ggplot(data=data,aes(x=Ownership_type,y=log(AvgSalary),colour=Degree))+geom_boxplot()

After checking a few more additional EDA, we want to add the complexity in our models for the prediction purpose.

Objective 2: Add more complexity in our previous model- Prediction Models

Based on the previous EDA and models, we will fit the model 4 with the same subset of predictors of model 3 with the additional interactions between Experience, Degree, and Job titles.

Model 4: Glmnet + Interaction + log transformation

#glmnet - Regression Coefficients
library(caret)
set.seed(123)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=10) 
glmnet.logfit2<-train(log(AvgSalary)~ Hourly + Job.Location + Python + Sector + Revenue + sas + mongo + Job_title * Experience *Degree,
               data=data,
               method="glmnet",
               trControl=fitControl
               )
opt.pen2<-glmnet.logfit2$finalModel$lambdaOpt 
coef(glmnet.logfit2$finalModel,opt.pen2)
## 166 x 1 sparse Matrix of class "dgCMatrix"
##                                                                                              s1
## (Intercept)                                                                        4.1159900843
## Hourly                                                                            -0.2652120322
## Job.LocationAZ                                                                    -0.1365085214
## Job.LocationCA                                                                     0.1999122557
## Job.LocationCO                                                                     0.0098332731
## Job.LocationCT                                                                    -0.0949371974
## Job.LocationDC                                                                     0.0894485958
## Job.LocationDE                                                                     0.1001202861
## Job.LocationFL                                                                    -0.1431111175
## Job.LocationGA                                                                    -0.1086986009
## Job.LocationIA                                                                    -0.0752283689
## Job.LocationID                                                                    -0.0421720950
## Job.LocationIL                                                                     0.0651836034
## Job.LocationIN                                                                    -0.0886974046
## Job.LocationKS                                                                    -0.0759943491
## Job.LocationKY                                                                     0.2114146064
## Job.LocationLA                                                                    -0.1730990154
## Job.LocationMA                                                                     0.0510232405
## Job.LocationMD                                                                     0.0668571219
## Job.LocationMI                                                                     0.1040172089
## Job.LocationMN                                                                     0.1498702836
## Job.LocationMO                                                                     .           
## Job.LocationNC                                                                     .           
## Job.LocationNE                                                                    -0.1372046261
## Job.LocationNJ                                                                     0.1729710066
## Job.LocationNM                                                                    -0.2469927314
## Job.LocationNY                                                                     0.0445769409
## Job.LocationOH                                                                     .           
## Job.LocationOR                                                                     .           
## Job.LocationPA                                                                    -0.0020659984
## Job.LocationRI                                                                     0.1509434295
## Job.LocationSC                                                                     .           
## Job.LocationTN                                                                    -0.1586873775
## Job.LocationTX                                                                     .           
## Job.LocationUT                                                                     0.0568406203
## Job.LocationVA                                                                     .           
## Job.LocationWA                                                                     .           
## Job.LocationWI                                                                    -0.0094422985
## Python                                                                             0.0615145510
## SectorAccounting & Legal                                                           .           
## SectorAerospace & Defense                                                          .           
## SectorAgriculture & Forestry                                                       .           
## SectorArts, Entertainment & Recreation                                            -0.4604294668
## SectorBiotech & Pharmaceuticals                                                    0.0474216772
## SectorBusiness Services                                                           -0.0207772932
## SectorConstruction, Repair & Maintenance                                           .           
## SectorConsumer Services                                                           -0.0916690034
## SectorEducation                                                                   -0.0690613632
## SectorFinance                                                                     -0.0373573528
## SectorGovernment                                                                  -0.1847200562
## SectorHealth Care                                                                 -0.0217456362
## SectorInformation Technology                                                       0.0424336322
## SectorInsurance                                                                    .           
## SectorManufacturing                                                               -0.0421184071
## SectorMedia                                                                        0.0664082740
## SectorMining & Metals                                                             -0.0108201730
## SectorNon-Profit                                                                  -0.0983351799
## SectorOil, Gas, Energy & Utilities                                                -0.1330184960
## SectorReal Estate                                                                  .           
## SectorRetail                                                                       0.0274670413
## SectorTelecommunications                                                          -0.0308000020
## SectorTransportation & Logistics                                                  -0.0354344441
## SectorTravel & Tourism                                                            -0.0979951340
## Revenue$1 to $5 million (USD)                                                     -0.0397659223
## Revenue$10 to $25 million (USD)                                                   -0.0521365560
## Revenue$10+ billion (USD)                                                          0.1165721993
## Revenue$100 to $500 million (USD)                                                  .           
## Revenue$2 to $5 billion (USD)                                                     -0.0209567634
## Revenue$25 to $50 million (USD)                                                   -0.0376156170
## Revenue$5 to $10 billion (USD)                                                    -0.0123117401
## Revenue$5 to $10 million (USD)                                                     0.2073031253
## Revenue$50 to $100 million (USD)                                                   0.0328939716
## Revenue$500 million to $1 billion (USD)                                           -0.0373297995
## RevenueLess than $1 million (USD)                                                  .           
## RevenueUnknown / Non-Applicable                                                    0.0429089675
## sas                                                                                0.0676556281
## mongo                                                                              0.0816971279
## Job_titledata analitics                                                            .           
## Job_titledata engineer                                                             0.3314302082
## Job_titledata modeler                                                              .           
## Job_titledata scientist                                                            0.3608938170
## Job_titleData scientist project manager                                            0.0054351676
## Job_titledirector                                                                  0.3672082414
## Job_titlemachine learning engineer                                                 0.5468054888
## Job_titleother data related jobs                                                   .           
## Job_titleother scientist                                                           0.1759767539
## Experienceno experience required                                                   .           
## Experiencesr                                                                       0.1439919059
## Degreena                                                                           .           
## DegreeP                                                                            0.0042125175
## Job_titledata analitics:Experienceno experience required                           .           
## Job_titledata engineer:Experienceno experience required                            .           
## Job_titledata modeler:Experienceno experience required                             .           
## Job_titledata scientist:Experienceno experience required                           .           
## Job_titleData scientist project manager:Experienceno experience required           0.0001729265
## Job_titledirector:Experienceno experience required                                 0.0034567215
## Job_titlemachine learning engineer:Experienceno experience required                .           
## Job_titleother data related jobs:Experienceno experience required                  0.0889889957
## Job_titleother scientist:Experienceno experience required                          .           
## Job_titledata analitics:Experiencesr                                               .           
## Job_titledata engineer:Experiencesr                                                .           
## Job_titledata modeler:Experiencesr                                                 .           
## Job_titledata scientist:Experiencesr                                               0.0604368288
## Job_titleData scientist project manager:Experiencesr                               .           
## Job_titledirector:Experiencesr                                                     .           
## Job_titlemachine learning engineer:Experiencesr                                    .           
## Job_titleother data related jobs:Experiencesr                                     -0.0324372378
## Job_titleother scientist:Experiencesr                                              0.1035644150
## Job_titledata analitics:Degreena                                                   .           
## Job_titledata engineer:Degreena                                                    .           
## Job_titledata modeler:Degreena                                                     0.3631055468
## Job_titledata scientist:Degreena                                                   0.0992442766
## Job_titleData scientist project manager:Degreena                                   .           
## Job_titledirector:Degreena                                                         .           
## Job_titlemachine learning engineer:Degreena                                        .           
## Job_titleother data related jobs:Degreena                                          .           
## Job_titleother scientist:Degreena                                                 -0.0453046611
## Job_titledata analitics:DegreeP                                                    .           
## Job_titledata engineer:DegreeP                                                     .           
## Job_titledata modeler:DegreeP                                                      .           
## Job_titledata scientist:DegreeP                                                    .           
## Job_titleData scientist project manager:DegreeP                                    .           
## Job_titledirector:DegreeP                                                          0.2612175942
## Job_titlemachine learning engineer:DegreeP                                         .           
## Job_titleother data related jobs:DegreeP                                           .           
## Job_titleother scientist:DegreeP                                                   .           
## Experienceno experience required:Degreena                                         -0.0595491137
## Experiencesr:Degreena                                                              .           
## Experienceno experience required:DegreeP                                           0.0435911779
## Experiencesr:DegreeP                                                               .           
## Job_titledata analitics:Experienceno experience required:Degreena                  .           
## Job_titledata engineer:Experienceno experience required:Degreena                  -0.0029401950
## Job_titledata modeler:Experienceno experience required:Degreena                    0.0012451030
## Job_titledata scientist:Experienceno experience required:Degreena                  .           
## Job_titleData scientist project manager:Experienceno experience required:Degreena  .           
## Job_titledirector:Experienceno experience required:Degreena                        .           
## Job_titlemachine learning engineer:Experienceno experience required:Degreena       .           
## Job_titleother data related jobs:Experienceno experience required:Degreena         0.1117304716
## Job_titleother scientist:Experienceno experience required:Degreena                 .           
## Job_titledata analitics:Experiencesr:Degreena                                      .           
## Job_titledata engineer:Experiencesr:Degreena                                       0.1233948832
## Job_titledata modeler:Experiencesr:Degreena                                        .           
## Job_titledata scientist:Experiencesr:Degreena                                      .           
## Job_titleData scientist project manager:Experiencesr:Degreena                      .           
## Job_titledirector:Experiencesr:Degreena                                            .           
## Job_titlemachine learning engineer:Experiencesr:Degreena                           .           
## Job_titleother data related jobs:Experiencesr:Degreena                             .           
## Job_titleother scientist:Experiencesr:Degreena                                    -0.1101920500
## Job_titledata analitics:Experienceno experience required:DegreeP                   .           
## Job_titledata engineer:Experienceno experience required:DegreeP                    .           
## Job_titledata modeler:Experienceno experience required:DegreeP                     .           
## Job_titledata scientist:Experienceno experience required:DegreeP                   0.0729664900
## Job_titleData scientist project manager:Experienceno experience required:DegreeP   .           
## Job_titledirector:Experienceno experience required:DegreeP                         0.0004680340
## Job_titlemachine learning engineer:Experienceno experience required:DegreeP        .           
## Job_titleother data related jobs:Experienceno experience required:DegreeP          .           
## Job_titleother scientist:Experienceno experience required:DegreeP                  .           
## Job_titledata analitics:Experiencesr:DegreeP                                       .           
## Job_titledata engineer:Experiencesr:DegreeP                                        .           
## Job_titledata modeler:Experiencesr:DegreeP                                         .           
## Job_titledata scientist:Experiencesr:DegreeP                                      -0.0007041066
## Job_titleData scientist project manager:Experiencesr:DegreeP                       .           
## Job_titledirector:Experiencesr:DegreeP                                             .           
## Job_titlemachine learning engineer:Experiencesr:DegreeP                            .           
## Job_titleother data related jobs:Experiencesr:DegreeP                              .           
## Job_titleother scientist:Experiencesr:DegreeP                                      0.0417117719
glmnet.logfit2
## glmnet 
## 
## 741 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 666, 667, 667, 667, 666, 667, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0003072974  0.2452159  0.5780582  0.1879434
##   0.10   0.0030729741  0.2442440  0.5783665  0.1873601
##   0.10   0.0307297408  0.2438513  0.5757182  0.1896072
##   0.55   0.0003072974  0.2443216  0.5799105  0.1873109
##   0.55   0.0030729741  0.2425565  0.5809966  0.1868317
##   0.55   0.0307297408  0.2585489  0.5422047  0.2029182
##   1.00   0.0003072974  0.2442717  0.5795284  0.1871899
##   1.00   0.0030729741  0.2416463  0.5828406  0.1867379
##   1.00   0.0307297408  0.2776270  0.4862482  0.2190794
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.003072974.
plot(glmnet.logfit2)

The best RMSE in the model 4 is 0.2416, which is about +/- 27% from the actual median salary. Notably the alpha is 1, which is more Lasso than gmnet approach. The lamnda or penalty term is the same as the model 3. The RMSE does not improve much in our opinion even with the new interaction terms.

library(rgl)
knitr::knit_hooks$set(webgl=hook_webgl)
plot(glmnet.logfit2$finalModel, xvar = "lambda", label = TRUE)

As we can see in the plot above, there are many coefficients are forced to be 0 when increasing the penalty term (alpha).

Model 5: Nonparametric model: knn + log transformation

library(caret)
set.seed(1234)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1) 
knn.fit1<-train(log(AvgSalary)~Job_title + Hourly + Job.Location + Python + Ownership_type +
                  sas + mongo + Experience + Degree,
               data=data,
               method="knn",
               trControl=fitControl
               )
opt.pen<-knn.fit1$finalModel$lambdaOpt 
knn.fit1
## k-Nearest Neighbors 
## 
## 741 samples
##   9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 668, 667, 665, 667, 668, 666, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE       Rsquared   MAE      
##   5  0.2702521  0.4806371  0.2063051
##   7  0.2720496  0.4748169  0.2100936
##   9  0.2716478  0.4756803  0.2106513
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

We also tried the nonparametric approach to see if they improved the salary prediction. Unfortunately, we are not able to try the tree models as the maximum amount of level in a categorical variables are 32. There are 37 different job locations in our data set. We will then use knn test to build the next two models. The difference between the model 5 and 6 is the addition of the interaction terms between the Job titles, the Experience and the degree.

#Model 6: KNN Fit

library(caret)
set.seed(1234)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1) 
knn.fit2<-train(log(AvgSalary)~ Hourly + Job.Location + Python + Ownership_type +
                  sas + mongo + Job_title* Experience * Degree,
               data=data,
               method="knn",
               trControl=fitControl
               )
opt.pen2<-knn.fit2$finalModel$lambdaOpt 
knn.fit2
## k-Nearest Neighbors 
## 
## 741 samples
##   9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 668, 667, 665, 667, 668, 666, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE       Rsquared   MAE      
##   5  0.2679873  0.4943566  0.2074422
##   7  0.2708543  0.4812245  0.2116457
##   9  0.2701284  0.4836510  0.2117632
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

Both model 5 and 6 gave similar results. Even though the addition of the interaction term brought down the RMSE, the difference is not significant( RMSE = 0.270 and 0.267 for model 5 nd 6 respectively). The model 4 with the glmnet, interaction terms still has the lowest RMSE of all.

#Initialize an randomly sampled list for prediction on AvgSalary, use each model to predict salary values

# Set seed 
set.seed(123)

# Number of samples to generate
num_samples <- 1  #We can adjust as needed

# Initialize an empty data frame for dummy data with the same columns as the original data
prediction_data <- data.frame(matrix(NA, nrow = num_samples, ncol = ncol(data)))
colnames(prediction_data) <- colnames(data)

# Loop through each column to generate random values (pulled from data)
for (col_name in names(data)) {
  if (col_name != "AvgSalary") {
    prediction_data[[col_name]] <- sample(data[[col_name]], num_samples, replace = TRUE)
  } else if (col_name == "AvgSalary") {
    # Set AvgSalary column to NULL (empty) for prediction
    prediction_data[[col_name]] <- rep(NA, num_samples)
  }
}
head(prediction_data)
##   Rating       Size   Ownership_type                 Sector
## 1    4.3 201 - 500  Company - Public Information Technology
##                            Revenue Hourly AvgSalary Job.Location Comp_history
## 1 $500 million to $1 billion (USD)      0        NA           TX           25
##   Python aws excel sql sas flink mongo google_an      Job_title Experience
## 1      0   1     1   1   0     0     0         0 data scientist         sr
##   Degree ML_skills vizualization data_framework log_Comp_history
## 1      P         0          none           none         2.302585
#Predict using GLMnet Model 3
prediction_data$Model3 <- predict(glmnet.logfit1, newdata = prediction_data)
prediction_data$Model3 <- exp(prediction_data$Model3)

#Predict using GLMnet Model 4
prediction_data$Model4 <- predict(glmnet.logfit2, newdata = prediction_data)
prediction_data$Model4 <- exp(prediction_data$Model4)


# Predict AvgSalary on the prediction_data using the knn.fit model
prediction_data$Model5 <- predict(knn.fit1, newdata = prediction_data)
prediction_data$Model5<- exp(prediction_data$Model5)


# Predict AvgSalary on the prediction_data using the knn.fit model
prediction_data$Model6 <- predict(knn.fit2, newdata = prediction_data)
prediction_data$Model6<- exp(prediction_data$Model6)

cat("Model 3 GLMNET Logged model without interaction,predicted an average salary of:", head(prediction_data$Model3), "\n")
## Model 3 GLMNET Logged model without interaction,predicted an average salary of: 118.9676
cat("Model 4 GLMNET Logged model with interaction, predicted an average salary of:", 
head(prediction_data$Model4), "\n")
## Model 4 GLMNET Logged model with interaction, predicted an average salary of: 108.845
cat("Model 5 KNN model predicted an average salary of:", head(prediction_data$Model5), "\n")
## Model 5 KNN model predicted an average salary of: 131.1116
cat("Model 6 KNN model predicted an average salary of:", head(prediction_data$Model6), "\n")
## Model 6 KNN model predicted an average salary of: 130.0268

We also created an example with random values of the prediction. Notably, the model 4 with the best RMSE gave us the lowest salary.

Conclusion

For the objective 1, there are a some predictors are both practically and statistically significant, such as job location, job titles and hourly. There are some predictors we thought that may matter to the average salary like the company size and the advance ML_skills, however, appeared to be not statistically significant.

After a few attempt with multiple technique, including adding interaction terms and nonparametric model, we still did not have a prediction as good as we hope for. Out of 4 different model with the increased complexity, the glmnet model with interaction terms has the lowest RMSE.

Given that the data set including the job postings scraped from Glassdoor, there is not much information how they were scraped or whether the data were collected randomly. There are some association between the predictors and the change in the median salary, including but not limited to the job locations, job title, Python requirement, etc. We assume that the results may be extended to the other data job postings on Glassdoor in 2021. However, if we want to extend the result to any data jobs, more job listings must be collected from different job sites, from other states, and so on.

For the future endeavors, we want to dig in more the methodology how the job listings were scraped from Glassdoor. We also want to address the weak correlation between the salary and the predictors. Maybe more the we need to collect better predictions that have a stronger relation with the salary.